Skip to main content

Anti-patterns and common pitfalls

Startups often adopt observability under tight time and budget pressures, which makes it easy to fall into patterns that feel suitable in that moment but become costly and fragile over time.

These anti-patterns were derived from startup-focused experience and customer insights, but they are broadly applicable to companies of all sizes.

Treating observability as a one-time initiative

Positioning observability as a finite project with a defined end date leads to stale dashboards, misaligned or silent alarms, and incomplete coverage as new services, environments, and AWS accounts are introduced. Observability must be managed as a persistent capability, with regular reviews and iterative enhancements tied to architectural evolution, deployment patterns, and changing business and compliance requirements.

Not following a staged crawl-walk-run approach

Designing a highly complex observability stack early on—for example, multi-region, multi-tenant telemetry pipelines with extensive custom enrichment and routing—introduces significant operational overhead and cognitive load without proven business value. Startups should first establish a minimal but robust baseline of metrics, logs, traces, and service-level alerts, then progressively introduce advanced capabilities as workload complexity and traffic justify the additional investment. 

Collecting high-volume telemetry without clear objectives

Startups should define explicit observability objectives, scope telemetry collection to those objectives, and apply sampling, aggregation, and filtering strategies to balance monitoring depth with performance and cost. 

Ingesting all logs, metrics, and traces at full fidelity, without defined observability use cases, drives excessive cardinality, degraded query performance, and high storage and ingestion costs, especially in high-throughput AWS workloads. For example, reducing or filtering out nonessential labels (such as granular request IDs or dynamic user identifiers) helps control cardinality. This not only lowers ingestion and storage expenses but also speeds up queries and simplifies dashboards, leading to more sustainable observability as your system scales.

Premature lock-in to a single observability vendor

Coupling instrumentation libraries, data schemas, and runbooks to a single observability vendor in early stages increases migration risk. It also limits cost and architecture flexibility as data volume grows.

To keep technical and economic options, startups should start using managed services that follow open standards like OpenTelemetry for instrumentation and data transport. They should adapt portable telemetry schemas and keep the option to send data to multiple or alternate backends. This flexibility allows easy shifts to cost-effective options early on and makes it simpler to re-design, tier, or diversify observability tools as scale and budgets change.

By emitting metrics, logs, and traces with OpenTelemetry SDKs and exporters, a service can send the same telemetry stream to Amazon CloudWatch or Application Signals today and, if needed, to another backend later by changing collector or exporter configuration rather than application code. For example, a checkout service instrumented with OpenTelemetry can send OpenTelemetry data to an AWS Distro for Open Telemetry collector that fans out to CloudWatch for day‑to‑day operations and to an alternate endpoint for specialized long‑term storage like Amazon S3, allowing the startup to evolve its observability architecture without getting locked into vendor APIs and technology and large-scale re‑instrumentation effort.

Adopting a tool-centric rather than culture-centric observability model

Simply enabling services and features such as Amazon CloudWatch, AWS X-Ray, or third-party APM (Application Performance Monitoring) integrations, without engineering teams actively instrumenting code paths and using telemetry in their workflows, does not yield effective observability. Engineering teams must incorporate observability into development and operations practices—defining and owning health signals, embedding dashboards and alerts into incident response and runbooks, and using telemetry to inform design, capacity planning, and post-incident reviews. 

Absence of governance for telemetry and metadata standards

Allowing each team to independently define metric names, label sets, log formats, and trace attributes produces fragmented datasets that are difficult to join, query, and correlate across services and environments. Organizations should establish and enforce telemetry governance, including standardized naming conventions, required dimensions (such as service, environment, region, and tenant), and shared schemas implemented via common libraries and templates.

Neglecting customer-centric and user experience indicators

Focusing primarily on infrastructure-level signals such as CPU, memory, and disk metrics while omitting user-centric and business KPIs obscures the actual customer impact of incidents. For example, an API may appear healthy at the host level while customers experience degraded journeys due to elevated latency, timeouts, or increased error rates on key flows such as checkout or onboarding in an e-commerce application. These signals should be modeled as first-class service and business-level SLOs tied to user experience.

Lack of defined data retention and tiering policies

Relying on default or unbounded retention for logs and metrics leads to uncontrolled growth in storage and analytics costs and can degrade performance of queries and dashboards over time. Startups should define tiered retention policies per telemetry class—for example, short-term high-resolution data for incident response, down sampled or aggregated metrics for long-term trend analysis, and lifecycle rules to archive or purge obsolete data in line with regulatory, operational, and cost requirements.